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Abstract 

Byzantine Fault Tolerant (BFT) systems are considered by 
the systems research community to be state of the art with 
regards to providing reliability in distributed systems. BFT 
systems provide safety and liveness guarantees with reason- 
able assumptions, amongst a set of nodes where at most 
f nodes display arbitrarily incorrect behaviors, known as 
Byzantine faults. Despite this, BFT systems are still rarely 
used in practice. In this paper we describe our experience, 
from an application developer's perspective, trying to lever- 
age the publicly available and highly-tuned "PBFT" middle- 
ware (by Castro and Liskov), to provide provable reliabil- 
ity guarantees for an electronic voting application with high 
security and robustness needs. The PBFT middleware has 
been the focus of most BFT research efforts over the past 
twelve years; all direct descendent systems depend on its ini- 
tial code base. 

We describe several obstacles we encountered and draw- 
backs we identified in the PBFT approach. These include 
some that we tackled, such as lack of support for dynamic 
client management and leaving state management com- 
pletely up to the application. Others still remaining include 
the lack of robust handling of non-determinism, lack of sup- 
port for web-based applications, lack of support for stronger 
cryptographic primitives, and others. We find that, while 
many of the obstacles could be overcome with a revised 
BFT middleware implementation that is tuned specifically 
for the needs of the particular application, they require sig- 
nificant engineering effort and time and their performance 
implications for the end-application are unclear An appli- 
cation developer is thus unlikely to be willing to invest the 
time and effort to do so to leverage the BFT approach. We 
conclude that the research community needs to focus on the 
usability of BFT algorithms for real world applications, from 
the end-developer perspective, in addition to continuing to 
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improve the BFT middleware performance, robustness and 
deployment layouts. 

1. Introduction 

Byzantine Fault Tolerant (BFT) systems are considered by 
the systems research community to be state of the art with re- 
gards to providing reliability in distributed systems. A BFT 
system implements a replicated state machine |28] typically 
consisting of n = 3/ + 1 replica servers that each provide 
a finite state machine and execute operations from clients 
in the same order BFT systems assume a pessimistic fail- 
ure model, based on the classic Byzantine generals' prob- 
lem i^'l which provides agreement amongst a set of nodes 
where at most / nodes display arbitrarily incorrect behav- 
iors, known as Byzantine faults. 

BFT systems are attractive because they provide guaran- 
teed safety and liveness properties when the assumption of 
up to / faulty nodes hold. Early work on BFT systems was 
widely considered to be impractical for use by real systems 
because they were either too slow to be used in practice or 
assumed synchronous environments that rely on known mes- 
sage delay bounds. However, the seminal work of Castro and 
Liskov |8], published in 1999, changed this view. This work 
proposed and implemented Practical Byzantine Fault Toler- 
ance achieving impressive peak throughput of several tens of 
thousands (null) operations per second, previously thought 
unattainable. As has been noted by others 111 111 , over the last 
twelve years, the research community has seen a flurry of 
excitement with several efforts to improve the performance 
and/or cost of BFT replication systems. These efforts include 
studies aimed at increasing throughput or reducing latency 
of client requests [4, 11, 12, 14, 16, 20, 21, 31-33], efforts to 
reduce the number of replica servers needed to withstand / 
faults to achieve lower replication cost flS, 32, 33|, and ef- 
forts to boost the robustness of the protocol under both faulty 
servers and faulty chents |T lll . A majority of these sys- 
tems iS El EH lllljare direct dependents of the 
Castro and Liskov system, hereonin referred to as the PBFT 
approach (for Practical Byzantine Fault Tolerance). Both the 
implementations and evaluations of these systems depend on 
the initial PBFT code base. 

Despite PBFT's attractive correctness guarantees, BFTs 
are still rarely used in practice. This is unfortunate, given the 
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ever-increasing need for reliability in real-world distributed 
systems. More and more applications require utmost secu- 
rity and reliability to be both trustworthy to users and suc- 
cessful in use (e.g, electronic voting and digital preserva- 
tion). The lack of wide deployment of state-of-the-art BFT 
technologies is also puzzling. The open-source PBFT code 
initially provided by Castro and later modified by others has 
been publicly available, improved, and fine-tuned for several 
years, and while readily sized up by the academic commu- 
nity for research purposes, it has not been used in practice in 
real-world systems. 

In this paper, we examine, from the perspective of an ap- 
plication developer, the practicality, i.e., feasibility, of us- 
ing the PBFT protocol and accompanying implementation 
to provide provable reliability guarantees for a real-world 
application. Our motivating application is a state-of-the art 
electronic voting system, offered a s a p ublic Internet service. 
The current version is centralized 01911 . Given the critical na- 
ture of the application, our aim is to build a system that has 
no centralized component. Every aspect of the system's de- 
sign should be distributed to avoid single points of attack 
and failure. Our aim is to leverage the correctness guarantees 
provided by PBFT systems to improve the security and reli- 
ability properties of the system. In such a system, clients (on 
behalf of users/voters) connect to the voting service, view 
the election procedures to which they have a right to par- 
ticipate, send the user's vote, and potentially reconnect at a 
later point to view the progress and/or results of the elec- 
tion. Our aim has been to gauge, from the perspective of 
a developer in need of providing reliability beyond simple 
crash-fault recovery, how easily the PBFT approach and ac- 
companying system could be molded to fit the application 
developer's needs. 

We have focused on the original PBFT implementation 
for several reasons. First, over the past twelve years, the 
majority of research efforts on improving BFT systems have 
relied on the PBFT approach and implementation. This is the 
most stable code base that is publicly available and has been 
fine-tuned and improved over several years and by several 
developers. Second, even as the debate over improving BFT 
systems continues, the interface to application developers 
provided by the PBFT middleware remains the same. This 
means that any later developments in the PBFT system suite 
can be easily leveraged by applications. Since later systems 
are not as fine-tuned as the original PBFT code base, we have 
chosen the original for more stability. Third, our particular 
electronic voting application is written in C; the PBFT code 
base is written in C-H-. A recent effort, called UpRight ifioll 
aimed at easing the application developer's effort to make 
use of BFT technology is written in Java, still has several key 
features missing (e.g., view changes are unimplemented), 
and seems to be a work-in-progress that has not seen much 
development in the last year and a half. Thus, for a developer 



wanting to leverage the attractive reliability guarantees of 
BFT now, the original PBFT system offers the most promise. 

We describe our experience trying to leverage the PBFT 
approach and code base to enhance the reliability of our 
evoting application. We describe several obstacles we en- 
countered and drawbacks we identified in the PBFT ap- 
proach. One key drawback we identified is that PBFT-based 
systems assume static membership - ie., clients and replica 
servers know each other apriori before system initialization. 
Most Internet services require support for dynamic client 
management, particularly when the number of envisioned 
clients is large. The PBFT literature (original as well as all 
subsequent descendants of PBFT) does not address this is- 
sue. Another key drawback is that PBFT leaves state man- 
agement completely to the application developer, who is re- 
quired to manually manage a raw memory region, while also 
issuing notifications to the library before changing memory 
contents. This may be fine when developing system services, 
but is not a very convenient base for an application. Ad- 
ditionally, PBFT treats a replica server's memory as stable 
storage, by assuming the use of uninterruptable power sup- 
plies 1 8]. Many Internet application services, particularly an 
electronic voting system, cannot afford to rely on this as- 
sumption and instead require traditional ACID semantics to 
ensure data stored is consistent and persists despite crashes 
and faults. The PBFT system suite leaves state management 
to the application developer This means that an appUca- 
tion developer wishing to make use of an available legacy 
database to provide the required ACID semantics is faced 
with the decision of implementing from scratch these seman- 
tics into the application or retrofitting the BFT middleware 
to interface with and support the legacy database. 

In addition to the above, we describe a number of other 
drawbacks including: the mechanism used by PBFT to han- 
dle nondeterminism in applications, the lack of support for 
stronger cryptography, the lack of support for web-based ap- 
plications, and others. The description of our experience may 
seem pedantic, with many minute low-level details, but we 
provide these here to give the reader a clear understanding, 
from a holistic systems perspective, of the obstacles faced 
by a developer trying to put the PBFT system to real, practi- 
cal use. These are details that are often considered "not im- 
portant enough" to warrant attention and space in many re- 
search papers (and prototype implementations, for that mat- 
ter), usually due to time and space constraints. Nonetheless 
they can trip up a third-party developer hoping to make use 
of the novel research prototype. In practice, it is the details 
that make or break the widespread deployment and use of a 
system. 

We find that while many of the obstacles we describe 
could be overcome with a "better" or "revised" BFT middle- 
ware implementation that is tuned specifically for the needs 
of the particular application, they require significant engi- 
neering effort and time. Even less encouraging is the fact 
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that the performance impHcations of the changes required to 
meet the appUcation's needs are unclear For example, we 
describe how we overcome the first two drawbacks above. 
While adding support for dynamic client management does 
not significantly affect system performance, measured in 
null operations per second, retrofitting the PBFT middle- 
ware to support a legacy database reveals a throughput per- 
formance of real operations that is two orders of magnitude 
smaller than the null ones, advertised by prior BFT studies. 

To date, only two publications on BFT that we are aware 
of have noted that reporting null operations per second as 
throughput is not representative of real applications and thus 
not helpful to the end- developer mill. This is understand- 
able, as the focus of most BFT research efforts has not been 
on end-application use but on improving the BFT middle- 
ware itself. Nonetheless, a developer faced with having to 
make a slew of modifications to the BFT middleware to get 
an end-system that has unknown performance properties is 
hesitant to invest the effort to do so. 

This paper makes the following contributions: 

• We identify a number of drawbacks in the PBFT protocol 
suite, from the perspective of an end-application devel- 
oper trying to leverage PBFT reliability guarantees and 
we describe a number of potential solutions to address 
these. The sheer number of drawbacks severely affects 
the ease with which a developer can leverage the PBFT 
approach. 

• We present changes we made to the PBFT protocol and 
implementation to enable dynamic client management, 
a must for many Internet service applications in use to- 
day. We show that these changes can be made with mini- 
mal additions to the PBFT protocol, thus not affecting its 
provable reliability guarantees. We demonstrate, via em- 
pirical experiments, that support for dynamic client man- 
agement can be achieved with minimal performance im- 
pact. 

• We evaluate the performance impact of retrofiting the 
PBFT middleware to support ACID semantics via a 
widely-used legacy database to ease the state manage- 
ment burden of many applications requiring these se- 
mantics. We evaluate the impact on performance of 
this change, and show that for non-null operations, the 
throughput can be many times smaller than the tens of 
thousands of null operations per second presented in prior 
PBFT-based studies. 

2. Background 
2.1 Original algorithm 

The Castro-Liskov algorithm for Practical Byzantine Fault 
Tolerance [8] (abbreviated as PBFT) is a replication algo- 
rithm that can tolerate arbitrary faults. It is based on State 
Machine Replication jl'A l28ll where transitions are applied 
to an instance of the application's state and result in a new. 



deterministic instance of the state. The general idea is that a 
group of replicas form a static group that provides a service. 
At each instance in time, one of them is the primary and is 
responsible for sequencing the requests, providing total or- 
der This in turn guarantees linearizability [ISJ , which is a 
correctness condition for concurrent objects where a concur- 
rent computation is equivalent to a legal sequential compu- 
tation. A view is the epoch where the primary is stable. The 
remaining replicas monitor client requests and the primary's 
behavior and, if the latter is found misbehaving, begin a view 
change procedure and elect the new primary. 

The algorithm is asynchronous and provides liveness 
and safety guarantees when less than a third of the replicas 
are faulty. More specifically, to tolerate / Byzantine faults, 
the group needs at least 3/ + 1 members. Safety, formally 
proved by using the I/O Automaton model |25], guarantees 
that replies will be correct according to linearizability. Live- 
ness assures that chents will eventually receive replies to 
their requests. The algorithm does not rely on synchrony to 
provide safety but does rely on a weak synchrony assump- 
tion to provide liveness: that delay(t) does not grow faster 
than t indefinitely. Here, delay(t) represents the time inter- 
val between initial message transmission (t) and message 
delivery to the replica process. For the protocol to be live, 
the client is expected to keep retransmitting its request until 
it finally obtains the reply. Further assumptions include in- 
dependence of node failures and inability of an attacker to 
subvert cryptographic protocols. 

In normal operation, the client sends a request to the pri- 
mary. The primary assigns a monotonically increasing se- 
quence number to the request and begins a 3 -phase agree- 
ment protocol with the other replicas, at the end of which 
each node executes the request and directly transmits the re- 
ply to the client. The latter will accept the reply as correct 
only when / + 1 replies much. The 3-phase protocol con- 
sists of the exchange of the following messages, where the 
target of a multicast is the set of replicas: 

1 . Pre-prepare, multicast from the primary, which assigns a 
sequence number to a request and forwards its contents 

2. Prepare, multicast by each replica, agreeing to the se- 
quence number assignment 

3. Commit, multicast by each replica, which helps guarantee 
total ordering across views 

After the commit, each replica will execute the request 
and transmit the reply directly to the client. In all above mes- 
sage exchanges, the sender is expected to sign the contents 
with his private key. 

This operation is depicted in Figure[I] 

Certain optimizations were applied by Castro and Liskov 
to this basic mode of operation in order to improve the la- 
tency and throughput of the system. First of all, the use of 
asymmetric cryptography was reduced, by introducing Mes- 
sage Authentication Codes. The client assigns a different 
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Figure 1. Normal PBFT operation 



key to each replica and sends the key to it, signed with the 
node's public key. From then on, all requests are accompa- 
nied by an ' authentic ator', which is a structure that contains 
one MAC for each replica. This considerably boosted per- 
formance, as we be confirm in Section |4] Another optimiza- 
tion is the tentative execution of requests before the com- 
mit phase. The client cooperates in this mode of operation 
as it expects 2fH-l tentative replies (marked as such by each 
replica) instead of the normal f-n 1 . If such a quorum is not as- 
sembled, the client simply retransmits the request message. 
As the replicas will in turn retransmit the last reply for this 
client (which by now should be marked as stable, since the 
Commit phase should be over), a smaller quorum of f-nl sta- 
ble (non-tentative) replies may be enough. 

Yet another optimization is the special treatment of read- 
only and big requests. A request is considered big if its size 
exceeds a configurable threshold, while the read-only status 
is explicitly set by the client. These differentiated requests 
are multicast from the client to all replicas, to relieve this 
burden from the primary. This mechanism is utilized by de- 
fault to the maximum extent, by defining the threshold to 
0, resulting in all requests treated as big. The read-only re- 
quests are specially treated and are executed as soon as they 
are received, sequencing permitting, of course. Finally, re- 
quest batching is employed to minimize network usage and 
agreement latency. A congestion window is defined as the 
number of requests that have been received but not yet exe- 
cuted by the primary; its size is an adjustable parameter of 
the system. When the primary receives a request message, 
it calculates the difference between the last locally executed 
sequence number and the sequence number assigned to the 
new request. If this difference exceeds the defined conges- 
tion window, it postpones issuing the pre-prepare message, 
giving itself time to catch up on request execution. Once it 
does, it includes in a single pre-prepare message, as many 
outstanding request messages as possible, thus minimizing 
latency due to individual agreement. Note that batched re- 
quests capture parallelism from different clients, as each 
client is allowed a single outstanding request only. 



An implementation of the protocol was developed by the 
author, Miguel Castro, and published as open-source along 
with his dissertation. The environment chosen was: 

• Linux as the platform (but mostly POSIX compliant) 

• C++ as the base language 

• UDP as the network protocol 

• An implementation of the Rabin cryptosystem for asym- 
metric cryptography 

• An implementation of UMAC32 for MAC operations 

• An implementation of MD5 for digests 

This implementation defines application "state" as a sin- 
gle continuous virtual memory region. In fact, it splits this 
region in two, the first part for the internal library needs and 
the remaining for the application. The library has a subsys- 
tem that manages the synchronization and checkpointing of 
this state using copy-on-write techniques and Merkle (hash) 
trees fl. The general idea is that the state is divided in pages 
of equal length. A hash tree is formed where the leaves are 
the actual data pages while the inner nodes are the hashes of 
their children (either of the data pages at level height- 1, or of 
the hash text at smaller depths). At the root, a single digest 
uniquely identifies the complete memory region. A check- 
point message communicates this root hash to the rest of the 
replicas, to agree that the state is properly synchronized. If 
a peer finds itself out of sync, an efficient tree walking algo- 
rithm is started from the root, to identify the (hopefully few) 
data pages that are different and have them retransmitted by 
the rest of the group. 

The server part of an application wishing to use PBFT 
services, is expected to initialize the library and then wait 
for up-calls from it, to service requests and produce replies. 
While executing, it has free read access to arbitrary memory 
regions inside the "state" managed by PBFT, but is expected 
to notify the library before making any changes. 

2.2 Reasoning about the default implementation 

It is very hard to reason about the behavior of a distributed 
system when it is run on multiple hosts, without a common 
clock. Although solutions such as vector clocks exist, it 
would be too intrusive to retrofit them in the existing library, 
just for the sake of monitoring its operation. To this end, we 
modified the library to be able to run multiple times on the 
same host, using different port numbers. We also created a 
log of all messages exchanged between replicas that, given 
the common clock, allowed us to reason about the behavior 
of the system. All further observations are based on this 
groundwork. 

2.3 Authenticators and Erratic Recovery Behavior 

In an attempt to closely monitor and better understand the re- 
covery process, we stopped and restarted a replica, using the 
default optimal configuration. We immediately witnessed er- 



4 



2011/10/24 



ratic behavior in the recovery process, which started and re- 
synchronized the state to the latest checkpoint, but was un- 
able to execute the few requests remaining in the log after 
that point, because they failed the authentication test. What 
we found after further investigation was that the use of au- 
thenticators, introduced for efficiency, impeded the recovery 
process, because the transient state of the restarted replica 
had no recording of the authenticators to use for validating 
client requests. The solution the existing system implements, 
is the blind retransmission of the authenticators from each 
node to all replicas, based on a timer. This way, once the re- 
covering replica receives the authenticators of the clients, it 
will be able to resume the recovery process from the next 
checkpoint. The only way to lower the time frame for this 
service interruption, is to reduce the authenticator retrans- 
mission timeout, which results in increased load for the net- 
work. We investigated other solutions including on-demand 
retransmission of the authenticators; we did not pursue this 
however, because retransmissions could introduce denial-of- 
service vulnerabilities, as a faulty replica could simply bom- 
bard the clients with authenticator retransmission requests. 

2.4 PBFT Behavior on UDP Packet Loss 

The definition of a Byzantine fault, for the PBFT library is 
any possible fault, including an error as trivial as a UDP 
packet loss. This creates interesting behaviors. We observed 
that UDP packets were indeed lost in our experiments, even 
in the loop-back interface, due to congestion caused by 
stress-testing the system. The impact of this is profound, 
as such an error will leave a replica lagging behind in trans- 
action execution and will cause the recovery process to com- 
mence on the next checkpoint. Unfortunately, although this 
apporach is theoritically very elegant, it is unacceptable for 
a production environment to lose nodes from such trivial 
errors. 

One of the optimizations described above, regarding the 
special handling of big requests, combined with a trivial 
UDP packet loss, can greatly affect the robustness of the 
system. In this case, big requests are multicast to all replicas 
only once, from the client. The primary will then use only 
the digest of the request body for further communication 
with the rest of the replicas. Consider what happens if one 
of the packets traveling from the client to one of the replicas 
is dropped on the way. All replicas will begin the three- 
phase protocol to commit and execute the request, but when 
execution time comes, the replica that missed the request 
body will be unable to execute, and will be stuck at this point 
until the next checkpoint arrives and the recovery process 
kicks in. For a request not marked as big though, the process 
is different and more stable. Here, if the request from the 
client to the primary is dropped, the client will timeout 
and retransmit the request, resulting in a request execution 
workflow where either all or no replica at all participates. 
Even in this case, a replica-to-replica packet loss would 
again result in interruption of service for one of the replicas. 



but perhaps in some environments, one can assume this to be 
less frequent than client-to-replica packet loss. 

2.5 PBFT Handling of Non-determinism 

In the original PBFT implementation, a feature was in- 
troduced to resolve the non-deterministic characteristics 
of most applications. The primary makes an application- 
specific up-call, which returns a set of values that are at- 
tached by the primary to the Pre-Prepare message. This 
data becomes common to all replicas executing the request, 
thus providing deterministic behavior on request execu- 
tion. Subsequent work on the PBFT protocol added 
an extra mechanism to validate this data on each replica. A 
new application-specific up-call was established that, when 
passed the non-deterministic data, is expected to validate it 
and return success or failure. The idea is, for example, that 
the primary attaches the system clock to the Pre-Prepare 
message, and each replica validates the passed value against 
its own clock to make sure it is appropriate. 

However, the handling of non-determinism described 
above introduces a subtle issue. It is not always clear how the 
application can validate the non-deterministic data passed to 
it via the new upcall. The hurdle for such a validation is 
the instance in time it is supposed to happen. In the nor- 
mal, fault-free lifetime of a request, the validation happens 
as soon as the Pre-Prepare message is received, which is 
almost immediately after it is transmitted. Thus validating 
against a time delta is viable. However, when a request is 
replayed from the log during recovery, the time drift can be 
quite large and validating using a time delta will fail and im- 
pede the recovery process. A solution to this issue would be 
to differentiate message processing for the recovery process 
and completely skip non-deterministic data validation dur- 
ing recovery. This however is again a non-trivial exercise, as 
message execution in PBFT is completely orthogonal to its 
origin. 

3. PBFT Deployment Drawbacks and 
Obstacles 

In this section, we present in detail a number of the obstacles 
we encountered in trying to leverage the PBFT approach and 
implementation for our Internet evoting service. While some 
of the details may seem pedantic and low-level, we include 
them here to give the reader a clear idea of the kinds of issues 
an application developer must face in porting his application 
to a BFT version. 

3.1 Dynamic client membership 

The existing PBFT protocol and implementation assumes 
completely static membership where each node in the sys- 
tem, client or replica, needs a priori knowledge of the ad- 
dress, port, and public key for every other node. For many 
applications, particularly Internet service applications with 
a large number of clients, such a closed system does not 
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suffice. Our goal is to remedy this to enable clients to join 
and leave the replicated service dynamically, while letting 
the replicas remain statically bound to one another. The end 
result is that clients only need information regarding repli- 
cas, but no information regarding other clients, allowing for 
a more scalable deployment. 

To achieve support for dynamic client membership, the 
replicas need to identify each client in an identical (deter- 
ministic) manner. This leads us to store the client identifiers 
in the shared state of the service (i.e., in the continuous mem- 
ory region). When a client requests to join or leave the group, 
each replica needs to process the request using the same ver- 
sion of the shared state. Thus, all such client requests need 
to be totally ordered, at least with respect to one another. 

We define two special system requests, namely a Join 
and a Leave, which follow the same life-cycle as all other 
application-level (client) requests. This results in a single to- 
tal order across all requests, application or system, fulfill- 
ing our requirement. The Join and Leave system requests 
are processed by the middleware library and are invisible to 
the application. 

We introduce a level of indirection between what the 
PBFT library already uses as a node identifier and what the 
client reception module assigns to new clients, for efficiency 
of message evaluation. Instead of using a single address 
range of {0..maxjdients\, an arbitrary identifier is assigned 
to each new client and a table maps this number to the index 
in the array of client and server node entries. This way, 
when a client request arrives, the system first checks to see 
if the identifier exists in the redirection table before going 
into the more lengthy process of verifying its signature or 
authenticator. 

Originally, our idea was for the client to multicast a sim- 
ple Join system request to all replicas, carrying its address, 
its public key and a random nonce, signed with its private 
key. Each replica would assign the same new identifier and 
transmit it back in the reply. However, nothing stops a ma- 
licious client from initiating an infinite number of connec- 
tions, using phony addresses, thus exhausting the bounded 
maximum number of node entries in each replica. To ad- 
dress this vulnerability, we improve the connection process 
by splitting the Join operation into two phases. In the first 
phase, the client submits its data as previously described and 
awaits a challenge. Upon receiving the challenge, the client 
calculates a response and transmits it back to the replicated 
service in the second phase of the Join. Only then will the 
rephca add the client to the system as a full member. This ap- 
proach ensures the cUent indeed owns the address he claims, 
as receiving the challenge is imperative to compute the re- 
sponse. 

We also add an application-level identification buffer to 
the Join message. This buffer is passed to the application for 
authorization. It might include, for example, an encrypted 
user id and password. The application then returns an iden- 



tifier to be associated with this client (such as the user id). 
The middleware library will then guarantee that only a sin- 
gle session can be active at a time for this specific identifier, 
by terminating all previous sessions when a new one is es- 
tablished. This way, even in a distributed denial of service 
attack, the attacker can only establish as many sessions as 
the number of credentials he has managed to obtain. 

The Leave system request is much simpler as it simply 
instructs each replica to remove the client from its internal 
tables. All further communication with the service is prohib- 
ited for this client. 

We need timeouts to enforce cleanup of stale sessions 
once the node structures are full. To achieve some common 
ground regarding time across all replicas, all requests are 
timestamped with the time of the primary; when each request 
is executed, its timestamp is recorded for each client. When 
a join request arrives that cannot be serviced because the 
client/server node table is full, a cleanup process is started 
that will locate all clients with a last executed request older 
than the current join request minus a configurable threshold. 
All such sessions are cleared to make room for the new 
connection. If no such stale sessions are found, the new Join 
request is denied. 

The Join process is depicted in the UML sequence dia- 
gram shown in Figure |2] 
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Figure 2. PBFT dynamic client join sequence diagram 

Note we have enhanced the PBFT protocol with support 
for dynamic client membership without changing the inher- 
ent properties and message exchanges of the protocol. Thus, 
our changes do not affect the safety and liveness guarantees 
offered by PBFT 
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3.2 A higher level state abstraction 

In a replicated state machine, the term 'state' is an abstract 
definition of the persistent workspace of the appHcation. 
PBFT defines state to be a continuous virtual memory region 
where both the application and the middleware library store 
their non-transient state, in contiguous non-overlapping par- 
titions. The middleware library has full access to this mem- 
ory region while the application code is not executing, since 
it is responsible for managing replication and synchroniza- 
tion of this state across replicas. The application, on the other 
hand, has free read access to it, but is required to notify the 
library before making changes to any region, thus permitting 
copy-on-write optimizations of state synchronization. 

While the above approach relieves the application con- 
siderably from having to deal with state synchronization, it 
creates a number of questions which the application devel- 
oper must face: What can a modern application do with just 
a pointer to a memory region? How is this state persistently 
stored on disk when the service stops? And how does the de- 
veloper avoid the havoc caused by a misbehaving application 
which fails to notify the library before modifying memory? 

To answer these questions in a satisfactory manner, we 
decided to adapt an embedded relational database engine, 
to intervene between the PBFT middleware library and the 
application. This way, the application will have SQL-level 
access to its state and the embedded engine will take care of 
interfacing with the PBFT library to satisfy its requirements. 

In our search for an embedded relational database engine, 
the major feature we were after was storage of data in a sin- 
gle file, which we could map to virtual memory. We selected 
SQLite [1] because it exhibits this feature and because it is 
mature and widely deployed. SQLite is an embedded, in- 
process library that implements a self-contained relational 
database engine using SQL as its command language and a 
C call level interface for the application. It stores all data ob- 
jects in a single database file that is binary compatible across 
machine architectures (endianness) and word sizes. 

In SQLite's quest to be a multi-platform product, the 
authors have defined an abstraction layer called VFS (Vir- 
tual File System), that sits between the relational engine 
and the operating system. By hooking into this subsystem, 
we not only can manage memory mapping and perform 
PBFT-required memory modification notifications, but also 
re-implement non-deterministic functions, such as system 
time and random values, by using the upcalls described in 
Section|2] Interaction with VFS is illustrated in Figure|3] 

SQLite uses two disk files to manage the database, for re- 
liability reasons. The first file is the actual database, which 
we map to virtual memory. The second file is the rollback 
journal (or write-ahead-log, in a different mode of opera- 
tion), which is used to rollback failed transactions. We left 
this second file to be stored on disk, since it allows the engine 
to recover in the case of system failure and it is not actually 
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Figure 3. SQLite with its VFS inside a PBFT application 

part of the application state. In any case, the database file is 
synchronized with its disk image on transaction commit. 

We gain many advantages with this approach. First, a 
committed transaction will be durable, even in the case of 
a system crash. That is, when the replica node restarts op- 
eration, its state will include the last committed transaction, 
and PBFT recovery will commence from this point. Second, 
even if the node is to be removed from the replicated service, 
its data will be usable on its own, being just another database 
file. Moreover, an uncommitted transaction will be rolled 
back on the next attempt to access the database file, from 
the replicated service or on its own. These advantages are 
simply the by-product of the ACID semantics that SQLite 
provides and excellent reasons why developers will likely 
want to take advantage of it. 

One obstacle we faced was that, while SQLite can freely 
manage the growth and shrinkage of its database file, PBFT 
is not so permitting, because it requires knowledge of the 
size of the memory region that represents the state, during 
its initialization. To alleviate this, we use a sparse file that is 
defined to be a large enough size on initialization, without 
actually occupying that space on disk, a solution that is 
reasonable in modern 64-bit operating systems with large 
virtual memory address ranges. 

The application code now simply passes the name of 
the database file to the PBFT initialization function re- 
sponsible for starting up the replica server and setting up 
any data structures needed by the middleware. The func- 
tion returns to the application code a standard SQLite 
database handle. Using this handle, the application can 
call standard SQLite library functions (e.g. sqlite3_exec, 
sqlite3_prepare_v2, sqlite3_step) to access the database while 
executing during the appropriate PBFT upcall. This way, an 
application already using SQLite is immediately portable 
to the PBFT middleware with only minor changes to the 
initialization code. 

3.3 Remaining issues 

We now describe a number of remaining issues we encoun- 
tered in the process of applying the PBFT approach to our 
electronic voting application service. 
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3.3.1 Cryptography 

Applications requiring strong cryptography, such as private 
key generation and storage on the server side of the appli- 
cation, are not well supported by the current PBFT imple- 
mentation. For key generation, strong random values are 
required. Unfortunately, even if the primary obtains such 
strong randomness from its local OS services, for example 
via /dev/random, there is no way such values can be veri- 
fied from the remaining replicas, by their very definition of 
being random. Because of this, an adversary can obtain ac- 
cess to one of the execution replicas, wait until it becomes 
the primary and use predetermined values instead of random 
values. In this manner, the adversary can trigger the genera- 
tion of well-known private and public keys and thus violate 
confidentiality. To alleviate such attacks, one solution would 
be to enforce a threshold signature scheme ifisll for such au- 
thentication requirements, provided for by the middleware 
library. In such a scheme, private key information for each 
replica would never be transmitted over the network, as it 
would not be stored in shared state. In a (/ + 1, n) (where 
n = 3/ + 1) threshold signature scheme, the set of n replicas 
would collectively generate a digital signature despite up to 
/ byzantine faults. Of course, the PBFT protocol would have 
to be modified to provide for such cryptographic operations. 

Another confidentiality issue is the matter of protecting 
storage of sensitive information. This has been studied by 
Yin et al [33], who propose separating the agreement part 
of the PBFT protocol from the execution part, while also 
adding an intermediate cluster of 'privacy firewall' nodes. 
In this layout, 3/ + 1 agreement nodes receive the client 
requests and forward them to 2/ + 1 execution nodes for 
execution. To ensure that a faulty execution node cannot 
disclose sensitive information, an /i + 1 rows by /i + 1 
columns privacy firewall set of nodes is positioned between 
the agreement and execution cluster, which allows tolerating 
up to h faulty firewall nodes. This obviously increases both 
deployment complexity and request execution latency. 

3.3.2 Stateless applications only 

The current implementation of the PBFT protocol purposely 
ignores the notion of client-specific state. This, however, 
severely limits the target applications to those that are either 
stateless by nature, or manage session state on their own us- 
ing their global state abstraction; the latter will need to pass 
session identifiers inside the request and reply bodies, with- 
out any assistance from the middleware library. This is not 
an inherent limitation of the State Machine Replication ap- 
proach. It is simply a consequence of the lack of appropriate 
mechanisms in the PBFT library. With our addition of ap- 
plication level sign-on messages to the protocol, resulting in 
identification of specific sessions, a library-level subsystem 
can be developed that will map parts of the state to a specific 
session. This would enable easier porting of stateful applica- 
tions to the BFT world. 



3.3.3 Web applications 

Our end goal is to provide a web application to end users, 
which provides them hassle-free access to the server coun- 
terpart of the evoting service. We aim to achieve this without 
sacrificing BFT semantics. To this end, the browser-hosted 
part of the application, typically written in JavaScript, will 
have to directly access each and every replica. This commu- 
nication however cannot be carried over UDP because this 
protocol is not allowed in the JavaScript runtime environ- 
ment. Moreover, binary messages are highly inconvenient 
in this context. Higher level protocols, such as WebSocket, 
and structures like JSON or XML need to be used. Support 
for these technologies needs to be incorporated in the mid- 
dleware library, a task not so trivial because of the need to 
switch from a point-to-point message-based communication 
to a connected channel-oriented communication. Addition- 
ally, cryptographic functions will need to be available in the 
browser-hosted client part, which requires transitioning from 
Rabin to more widely available cryptosystems, such as RSA. 

Additionally, we aim to have the replicas located in differ- 
ent physical locations, to obtain real independance of faults 
caused by network partitions. This requirement dictates op- 
eration in a Wide Area Network environment, where the 
quadratic message complexity of PBFT will most probably 
prove costly regarding request latency. Although we tried to 
simulate a WAN deployment scenario using BFTsim ifsoll . 
the simulator could not scale to a large enough number of 
nodes (> 100) to obtain meaningful results. This issue is al- 
ready studied in jstl, though no open source implementation 
is readily available. 

Summary: The above issues can be overcome, but re- 
quire a significant amount of engineering effort. An applica- 
tion developer wanting to leverage and deploy PBFT now is 
likely to be unwilling to invest the time and effort required 
to retrofit the PBFT approach to match the needs of his/her 
application. 

4. Evaluation 

In this section we present empirical measurements of the 
PBFT library, both with and without our modifications sup- 
porting dynamic client and seamless state management for 
applications requiring ACID semantics provided by a legacy 
database. 

We test the PBFT library and our modifications to it on a 
cluster of 8 machines connected with a 1GB Ethernet switch. 
The first four machines are Intel Xeon E5620 at 2.40 Ghz 
under CentOS 5.5 with Linux kernel 2.6.18-194. The re- 
maining four are Intel Core 2 Duo E6600 at 2.40 GHz un- 
der Debian 5.0 with Linux kernel 2.6.26. All eight machines 
run 64 bit versions of their corresponding operating systems. 
Ping roundtrip time is measured at 134-183 nanoseconds be- 
tween all hosts. Bandwidth is measured, using iperf, at 938 
Mbits/sec. For all tests, we generate a server and client ex- 
ecutable using a particular library configuration set so as to 
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measure the effect of turning on or off a particular optimiza- 
tion and/or modification. We designed the cHent to connect 
to the Hbrary and wait for a signal. On signal reception, it 
records the current time, starts its operation and then mea- 
sures and reports elapsed time. To coordinate all processes 
running on different hosts while at the same time collect- 
ing and aggregating measurements, we implemented a test 
framework using Python and netcat, where the latter runs 
on each host and allows a single controller to submit scripts 
(i.e., experiments) and collect the results. 

4.1 Non-SQL Experiments 

We first conduct an experiment without the SQL state ab- 
straction modifications we made in order to benchmark the 
plain PBFT implementation. Our goal is to measure the im- 
pact on system throughput of turning on/off the optimiza- 
tions described in Section |2] Recall that the use of certain 
optimizations (such as the use of MACs and special han- 
dling of big requests) increases performance at the cost of 
decreased robustness (e.g., slow recovery) of the system. 



We generate and test a series of PBFT library configu- 
rations, shown in Table [T] The first configuration is the de- 
fault configuration preferred and recommended by Castro, 
with all optimizations enabled, including the use of MACs, 
special treatment of all requests as big requests, and request 
batching. Since batching is the only optimization for which 
we did not observe faulty behavior, we isolate it and test all 
other combinations of configurations with batching enabled 
and disabled, to show its impact. The last four rows of Ta- 
ble [T] depict the most robust configurations (use of MACs 
and big request handling turned off). Since our particular ap- 
plication has stringent security and reliability requirements, 
we choose to measure the impact of adding support for dy- 
namic client management using these configurations. We be- 
lieve other Internet service applications with similar high se- 
curity and robustness needs would need to run the PBFT li- 
brary using these configurations. The client and server pro- 
grams built to measure throughput transmit null requests and 
responses of varying sizes, of 256, 1024, 2048 and 4096 
bytes. We test the system using 12 clients spread evenly 
across 4 machines while being serviced by 4 replicas, each 
running alone on a single host. In all cases, IP-level mul- 
ticasting was turned off, as the networks we are targeting 
(WANs) do not support it. 

The results for varying request and response sizes are 
similar, so for brevity we show a representative plot, for size 
of 1024 bytes in Figure g] 

From Table[T]and FigureH] it is clear that the first configu- 
ration, which is the default configuration of the PBFT library 
with all optimizations turned on achieves the best throughput 
performance. In our experiments, this configuration achieves 
approximately 17000 null operations per second, while for 
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Figure 4. PBFT tests 

the most robust configurations the throughput drops to about 
1000 null operations per second. 

We observe that disabling the batching optimization seri- 
ously affects performance when using MACs. When switch- 
ing to signing with private keys, the delay introduced is so 
large that batching can no longer assist in any way. More- 
over, when disabling big request handling, performance 
drops to 18% of the optimal, while disabling the use of 
MACs causes performance to drop to 7.5% of the optimal 
respectively. Disabling both big request handling and MAC 
use causes performance to drop to 6% of the optimal. While 
we observe a difference in performance amongst these con- 
figurations where some subset of optimizations is turned off, 
the bottom line is that performance takes a big hit when turn- 
ing off any of the optimization. However, for an application 
with high security requirements, we conjecture robustness is 
favored over performance. 

We evaluate the impact on performance of adding sup- 
port for dynamic client management using the most robust 
configurations. The performance decrease is 0,5% (988 vs 
992), which is negligible. This negligible decrease in perfor- 
mance is attributable to the cost of accessing the redirection 
table that converts assigned customer ids to indexes in the 
tables tracking participating nodes (clients and servers). We 
emphasize that the above tests are artificial because they are 
testing "null" operations. The software on the replica spends 
no time executing application code; it simply manages the 
network protocol. The large majority of prior BFT studies 
present throughput in terms of null operations per second. 
This is understandable as the focus is on providing a base- 
line benchmark against which varying BFT protocols can 
be compared, but is not helpful to the application developer 
who needs to understand how the system would behave us- 
ing real application requests. 

4.2 SQL state abstraction experiments 

In this subsection, we evaluate the performance of adding 
seamless state management for applications requiring ACID 
semantics provided by a legacy database. Null operations are 
thus not realistic to use in this setting. For our client appli- 
cation request we choose the insertion of a single row into 
a database table. This is the operation our evoting service 
must perform to record a user's vote in an ongoing election. 
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Table 1. PBFT library configurations we test. TPS is transactions per second, where a transaction is simply a null request. Null 
request and null response sizes are 1024 bytes. 



The tuple inserted into the database includes a simple key 
and value text (representing voter identity and accompany- 
ing vote), in addition to a timestamp and a random value. We 
purposefully added the timestamp and random value to test 
that replies are indeed identical across all replicas. For this 
experiment, we enabled request batching and varied turn- 
ing on and off the remaining options (use of MACs, big 
request handling, and support for dynamic clients). ACID 
semantics are provided using the rollback journal mode of 
SQLite. Throughput performance, measured as database in- 
sertion transactions per second, is illustrated in Figure |5] 
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Figure 5. PBFT + SQL benchmark 

In this experiment, the big request handling optimiza- 
tion pays no dividends because the system now spends time 
executing a real, non-null request which requires access- 
ing the hard disk. This dominates the overall request exe- 
cution lifetime. At any rate, the most robust configuration 
with dynamic clients enabled is now at 43% of the best 
(sta_mac_noallbig). Since disk access is a big factor in this 
experiment, we perform two more experiments to isolate its 
impact. In these experiments, we measure the most robust 
configuration (where the use of MACs and big request han- 
dling are disabled) with dynamic clients and ACID seman- 
tics (as above) and we measure another configuration with- 
out ACID semantics (no rollback journal and no flushing 
to disk on each operation). The ACID version achieves 534 



TPS while the No-ACID one scores 1 155, an approximately 
2x performance boost. 

Summary: The optimizations turned on by default in the 
PBFT library, lead to the high throughput numbers reported 
in prior studies, but as we have shown in Section |2] using 
some simple fault scenarios (such as UDP packet loss), the 
high performance numbers come at the cost of decreased ro- 
bustness of the system. Moreover, the performance numbers 
reported by a large majority of prior BFT studies are based 
on a metric of null operations per second. This is not a help- 
ful metric for the end-application developer, particular for a 
developer whose application makes use of a legacy database 
for ACID semantics. 

5. Related Work 

As cited in Section[T] since the seminal 1999 publication on 
PBFT by Castro and Liskov [SJ, there has been a flurry of 
research activity focused on improving the BFT middleware 
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cost ifTsi [32! I33I1 . and robustness under both faulty servers 
and faulty chents ifl [ Till. A majority of these systems 



I2IL I32L l33 f] are direct dependents of the Castro 
and Liskov PBFT system. Of all of these systems, the only 
codebase that has been made widely available and refined 
for several years is the PBFT system. For this reason, we 
have focused on this system. Since all PBFT descendants 
use the same codebase, the obstacles we encountered as 
application developers in using the PBFT system apply to 
its descendents as well. 

We highlight, below, some related works that either di- 
rectly focus on bringing BFT systems closer to widespread 
deployment in real applications, or raise issues that affect 
the practical deployment of our (and other) security-critical 
applications. 



Wood et al. 132[] write "no commercial data center uses 
BFT techniques despite the wealth of research in this area" 
and posit that this is due to the high cost of replication re- 
quired by BFT protocols. They aptly point out that, for ap- 
plications such as web servers and database servers, it is the 
execution of client requests and not the agreement of request 
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ordering that dominates the performance of a BFT protocol. 
They propose lowering the number of active execution repli- 
cas to / + 1 by using virtual machines as execution nodes 
and ZFS snapshots for quick state checkpointing. When the 
/ + 1 replicas produce inconsistent replies, a paused exe- 
cution node is revived and starts executing requests imme- 
diately. The middleware library fetches the state needed by 
these requests on demand, to amortize the cost of state trans- 
fer. The paper claims that for applications running over a 
WAN environment, the time to perform state transfer is min- 
imal compared to WAN latencies. The focus of the paper 
is on reducing replication cost while maintaining good per- 
formance. While this is welcome for an application to be 
deployed in a data center, the paper does not address how 
the application developer can easily make use of the system, 
stating simply that applications must be rewritten to take ad- 
vantage of the system. 

Clement et al. ifioll introduce UpRight, with the goal of 
making it easy for application developers to convert a crash- 
fault tolerant application into a BFT application. It includes 
a number of state-of-the-art BFT techniques, including sepa- 
ration of agreement from execution, insights from the Aard- 
vark protocol 1 1 1] on dealing with faulty clients and allevi- 
ating denial-of-service attacks, as well as more flexible state 
management (but not at such a high level as a relational en- 
gine). It also allows individual tailoring of crash-fault (Up) 
and arbitrary-fault (Right) tolerance. Unfortunately, it is still 
a work in progress, with several key features missing (e.g., 
view changes are unimplemented) and does not seem to have 
seen much development since March 2010 so it is not 
helpful to a developer wishing to make use of BFT tech- 
niques now. 

Several attempts have been made to address the inability 
of replicated BFT services to mesh with the rest of the infras- 
tructure in today's multi-tier world. Merideth et al. lE6ll in- 
troduced Thema, which aims to mask BFT complexity from 
the application developer of web services based applications. 
An agent, visible to the unaffected outside world, plays the 
role of the client of a BFT system. Additionally, a proxy 
collects the multiple out-call requests from the replicas of a 
BFT system, and issues the actual out-call on behalf of them, 
returning the reply when available. Unfortunately, both the 
agent and the proxy are centralized components which are 
inappropriate for applications such as ours which require 
completely distributed design. 

Pallemulle et al. [if] focuses on interoperability between 
BFT systems, while enforcing fault isolation and introduce a 
whole new protocol, named Perpetual to achieve this. Sen 
et al. [29] in a system called Prophecy, designed to in- 
crease BFT performance, introduce a Sketcher component, 
that tries to trade space for performance, by storing a histor- 
ical log of request/reply pairs and allowing the application 
to differentiate its requests, asking for possible log-based 
replies. In its distributed incarnation, D-Prophecy is simply 



an attempt to avoid re-execution of repetitive requests. In the 
centralized one. Prophecy, the Sketcher completely avoids 
BFT access but now becomes a centralized component. 

Amir et al. |5] introduce Steward, a hierarchical BFT ar- 
chitecture, that tries to scale BFT to a wide-area network, by 
introducing an abstraction layer above PBFT using a Paxos- 
based protocol. It uses a threshold signature scheme to en- 
sure the recipient of a cross-domain message that enough 
replicas at the originating site agreed with the request. Both 
of these features are welcome to security-conscious Internet 
application services. Unfortunately, no source code is read- 
ily available. 

Vandiver et al. [31] and Garcia et al. [16] introduce mid- 
dleware for BFT database replication. Incoroporating legacy 
databases into a BFT system is important for a wide range 
of Internet applications. Unfortunately, both systems assume 
closed systems with a finite number of clients. The devel- 
oper of an Internet-facing application service still must deal 
with the issue of having end-user clients issue requests to 
the replicated database system. Either these systems need 
to provide support for dynamic client management or they 
must offload the Internet-facing application component ac- 
cepting customer/user requests to a centralized component, 
something not appropriate for our particular application. 

Finally, Guerraoui et al. [ 17] introduce a new abstraction 
allowing for the construction of new BFT protocols with a 
fraction of the code currently necessary, thus vastly simpli- 
fying the BFT researcher's task. Having waded through the 
20,000 lines of PBFT code, we applaud this effort and em- 
phasize here the need to simplify the end application devel- 
oper's task as well. 



6. Conclusion 

This paper is a call to the systems community to look more 
closely at BFT from the perspective of a real-world applica- 
tion developer Our experience in trying to apply the PBFT 
approach to a real-world application with stringent security 
and reliability needs reveals a slew of difficulties that the 
application developer must face if he wants to use even the 
mature, stable and well-tuned PBFT protocol and codebase 
upon which a large majority of subsequent BFT systems is 
based. While the difficulties encountered by the developer 
can be overcome, they require significant engineering ef- 
fort and have unclear performance ramifications. These two 
characteristics are likely to make the developer hesitant to 
invest the effort to leverage BFT techniques. 

The systems community prides itself on building and 
measuring real systems. If BFT systems are to see widepread 
deployment in real-world systems, then the research com- 
munity needs to focus on the usability of BFT algorithms 
for real world applications, from the end-developer perspec- 
tive, in addition to continuing to improve BFT middleware 
performance, robustness, and deployment layouts. 
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Interestingly, we may find that the current BFT debate 
may evolve to resemble the microkernel debate [24], with 
one camp advocating that the BFT concept is ultimately im- 
practical for real- world applications |7] and the other camp 
advocating that it is not the concept that is impractical/faulty, 
but it is the implementation that is impractical/faulty. Build- 
ing a complete implementation that supports a real applica- 
tion for a long duration rather than for the length of time it 
takes to build and test a prototype implementation, that does 
not cut corners, that is not missing features, that does not 
make optimizations that break down in corner cases, that can 
be applied to more than one application, and that has good 
performance will go a long way to settling the debate. A tall 
order, for sure. 
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